Overview and Motivation

There is no denying the importance of education. At the individual level, a strong education is important for personal growth, acquiring knowledge, and developing critical thinking skills. At the societal level, educational outcomes are highly correlated to better health and higher earnings. However, in the United States, there are many factors that can influence the quality of education and students’ educational success. Understanding how these socioeconomic and political factors affect academic achievement outcomes can better inform future educational policies and programs.

Our group was inspired to pursue this project after many of us took a statistics course at Harvard Graduate School of Education during the Spring of 2021. For this class we often applied the methods we were learning to data from The Education Opportunity Project at Stanford University. This dataset, referred to as the Stanford Education Data Archive (SEDA) provides detailed information on educational outcomes by various levels (school, district, county, community, and state). We were motivated to use this first national database of academic performance to ask questions about predictors of academic achievement and school funding between states and at the national level.

In this section of the project (Part 1) we are specifically aiming to answer the following questions:

Throughout this RMarkdown, we will discuss in depth how our though process surrounding these questions were developed and how they evolved throughout the course of the project. To summarize, the results of exploratory data analysis and machine learning models helped to inform areas of focus in our project.

Part 1: Predicting District Academic Achievement

In the first part of our project, we aim to understand the variables that predict academic achievement on a district level. This section of our project has three components. First, we will use exploratory data analysis and machine learning modeling to determine key predictors of academic achievement on which to focus. Next we will take a closer look at some socio-economic status, unemployment, and poverty as predictors of academic achievement, using Washington and Massachusetts as case studies. Finally, we will explore how receiving free lunch at school predicts academic success.

Data Source:

In our project, we use the SEDA dataset that maps student achievement based on different district level covariates. The Educational Opportunity Project at Stanford University measures educational opportunity in diverse communities all over America. We loaded the file from the publicly available datasets that the SEDA project has online here We use the inner join function to join the achievement scores dataset and the covariates dataset.

# Load achievement data (district, CS, pooled) 
# data obtained from: "https://stacks.stanford.edu/file/druid:db586ns4974/seda_geodist_pool_cs_4.0.dta"
# This data set contains data on the outcome of interest, district academic achievement.
# In this data set, academic achievement is measured in terms of standard deviation 
# units and is what we will be using for our regression analyses
ach <- read_dta("seda_geodist_pool_cs_4.0.dta")  

# Load main covariates data (district level), pool 
# data obtained from https://stacks.stanford.edu/file/druid:db586ns4974/district%20covariates.dta"
# This data set contains an extensive list of covariates that will be tested as
# predictors of our outcome, academic achievement
covs <- read_dta("seda_cov_geodist_pool_4.0.dta")


# Merge files together. This keeps only matched districts.
dat <- inner_join(ach, covs, by = c("sedalea", "fips"))


# Subset to the "all" subgroup estimates for all students, only districts
# with an estimate of average achievement, unemployment, poverty, ses avergage, and non-missing SES.
dat <- filter(dat,
              subgroup == "all",
              !is.na(cs_mn_avg_ol),
              !is.na(unempavgall),
              !is.na(povertyavgall),
              !is.na(sesavgall))
nrow(dat)
## [1] 12438

After creating this data set, we noticed that the covariate data set that we used did not include variables related to district spending. This is an area of interest of our team, as it shows how policies and funding can play a round in academic outcomes. Therefore, obtained an additional SEDA dataset with district spending included and merged this with the data set above.

#importing the data set that includes variables of interest: ppexp_tot, ppexp_inst, pprev_tot
spending <- read_dta("district covariates.dta")  

# Subsetting data set to only include variables of interest
# also including the district ID, titled leaid here, and state ID (fips) for joining
spending <- subset(spending, select = c(ppexp_tot, ppexp_inst, pprev_tot, leaid, fips))

# since the name of the district ID variable in the dataset "spending" (leaid) is different
# from the district ID variable in the data set "dat" (sedalea), we need to change this

spending <- rename(spending, sedalea = leaid)

#checking the class of `sedalea` in "spending" df 
class(spending$sedalea)
## [1] "character"
#checking the class of `sedalea` in "dat" df

class(dat$sedalea)
## [1] "numeric"
#converting `sedalea` in "spending" df to numeric class so that they are both of the same class
spending$sedalea <- as.numeric(spending$sedalea)

#confirming that `sedalea` in "spending" is now numeric
class(spending$sedalea)
## [1] "numeric"
# Merge files together. This keeps only matched districts.
dat <- inner_join(dat, spending, by = c("sedalea", "fips"))

Part 1a: Deciding the predictors of interest

Explorotory Analysis

For the exploratory data analysis, we start by summarizing our outcome of interest: district academic achievement. As mentioned in the code above, the data set that we imported for our measures academic achievement in terms of standard deviation units away form the national mean. While we can compare relative performance using this measure, interpreting standard deviation units for exploratory data analysis is not too useful, and can be difficult to interpret. Therefore, we will import another SEDA dataset that measures academic achievement outcomes in terms of “grade levels” using the variable gcs_mn_avg_ol.

As with cs_mn_avg_ol, this variable indicates the average math and reading scores for grades 3 through 9 from academic years 2008-2009 to 2015-2016. This scale, however, is in “grade levles” where 5.5 is average across all districts (mid point of grades 3 and 8, the grades that participate in testing). Using this scale means that if a district has an average grade level of 7.5, their students math and reading scores are, on average, similar to students who are one grade level higher, when compared to an average district in the nation.

# loading SEDA data that includes academic achievement in "grade levels"

ach.EDA <- read_dta("seda_geodist_pool_gcs_4.0.dta")  

# Restricting the data set to only observations in the subgroup "all" and that does not have missing values for gcs_mn_avg_ol
ach.EDA <- filter(ach.EDA,
              subgroup == "all",
              !is.na(gcs_mn_avg_ol))

To begin our exploratory data analysis on our outcome, it would be helpful to first take a look at how much variation we have in our data. We can visualize this first by using a histogram.

# Creating a histogram of district academic achievement as measured by average "grade levels"ggplot(aes(x=gcs_mn_avg_ol), data = ach.EDA) +
ggplot(aes(x=gcs_mn_avg_ol), data = ach.EDA) +
  geom_histogram(color="black", fill="blueviolet") +
  xlab("Average Test Scores (Grade Levels)") +
  ylab("Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histogram above, we see that there is decent amount of variation in the district average achievement scores, as measured in “Grade Levels.” For example, some districts have average test scores of around 2.5– that means that district is testing 3 grade levels below the national average of 5.5! On the other hand, there are also some very high achieving districts. For example, some districts have average test scores at around 7.5 grade levels, 2.5 grade levels above the national average.

Our team is hoping to understand some of the district-level factors that leads to such variation in academic performance in the United States.

Before moving onto our analysis, we can also look at the variation in test scores between and within states using box plots:

ach.EDA %>% mutate(stateabb = reorder(stateabb, gcs_mn_avg_ol, FUN = median)) %>% 
  ggplot(aes(stateabb, gcs_mn_avg_ol, group = stateabb)) + 
  geom_boxplot(fill = "lightsalmon1", outlier.color = "black", outlier.fill = "lightsalmon1", outlier.shape = 21) +
  labs(x = "State Abbreviation", y = "Average District Test Scores (Grade Levels)") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

From the graph above, we notice there is a lot of variation in test scores both between and within states. Districts in Massachusetts have the highest median test score, performing at about 1.5 grade levels above the national average while New Mexico has the lowest median achievement scores with a median of around 0.5 grade levels below the national average. With that being said, within most states there is a lot of variation in performance. Within a given state, take California for example, you can find districts with both extremely high and extremely low test scores.

In our analysis, we hope to understand the factors that contribute to such variation between districts across the United States. The covariate data set has many different variables, many of which may predict district test scores. However, for the purpose of our project, we want to focus in on a few key covariates in order to produce more focused and meaningful conclusions.

To decide our priority predictors, we can start of by visually summarizing the overall relationships between these variables and our outcome for academic achievement, cs_mn_avg_ol. In contrast our achievement variable used in the plots above, cs_mn_avg_ol measures academic achievement in terms of standard deviations away from the national average. explain this variable more here and what it means

Before we begin, we need to specify our list of district-level predictors to choose from. Our current data set currently has approximately 60 covariates. However, some of these covariates are only applicable to some sub-populations. For example, for average SES, the data set includes a district-level measure and a measure broken down into different racial groups. For this project, we are only interested in the district-level measures.

Our final list of covariates that we will be analyzing is presented below:

dat.ML <- subset(dat, select = c(urban, suburb, town, rural, perind, perasn, perhsp, perblk, perwht, perfl, perrl, perfrl, perecd, perell, perspeced, totenrl, sesavgall, lninc50avgall, baplusavgall, povertyavgall, unempavgall, snapavgall,  single_momavgall, ppexp_tot, ppexp_inst, pprev_tot, cs_mn_avg_ol))
#running missForest to impute missing data
#data.ML.imputed <- missForest(as.matrix(dat.ML))

#converting output back to data frame to be used
#data.ML.imputed.df <- as.data.frame(data.ML.imputed$ximp)

#exporting the data frame with imputed values to be used in the future
#write.csv(data.ML.imputed.df, file = "data_ML_imputed.csv", row.names = FALSE)
  • urban: district is in a city/urban locale (binary; 0 = no, 1 = yes)
  • suburb: district is in a suburban locale (binary; 0 = no, 1 = yes)
  • town: district is in a town locale (binary; 0 = no, 1 = yes)
  • rural: district is in a rural locale (binary; 0 = no, 1 = yes)
  • perind: % Native American students in the district
  • perasn: % Asian students in the district
  • perhsp: % Hispanic students in the district
  • perblk: % Black students in the district
  • perwht: % White Students in the district
  • perfl: % students receiving free lunch
  • perrl: % students receiving reduced lunch
  • perfrl: % students receiving free or reduced lunch
  • perecd: % economically disadvantaged students in the district
  • perell: % English Language Learners
  • perspeced: % students who receive Special Education
  • totenrl: # students in the grade
  • sesavgall: SES composite score
  • lninc50all: log of median income
  • baplusall: % of adults in district with at least a bachelors degree (percentage)
  • povertyall: poverty rate (%)
  • unempall: unemployment rate (%)
  • snapall: SNAP recipient rate (%)
  • single_momall: % of household with children with single mom
  • ppexp_tot: revenue per pupil (dollars)
  • ppexp_inst: total per pupil expenditures on instruction (dollars)
  • pprev_tot: revenue per pupil (dollars)
  • gsi
  • cs_mn_avg_ol: District test-based achievement math and reading scores from ordinary least squares estimate

For the continuous predictors, we can do scatter plots:

ML.dta <- read.csv("data_ML_imputed.csv")

p1 <- ggplot(ML.dta, aes(x = perind, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% Native American Students") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p2 <- ggplot(ML.dta, aes(x = perasn, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% Asian Students") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p3 <- ggplot(ML.dta, aes(x = perhsp, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% Hispanic Students") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p4 <- ggplot(ML.dta, aes(x = perblk, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% Black Students") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p5 <- ggplot(ML.dta, aes(x = perwht, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% White Students") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p6 <- ggplot(ML.dta, aes(x = perfl, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% w/ Free Lunch") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p7 <- ggplot(ML.dta, aes(x = perrl, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% w/ Reduced Lunch") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p8 <- ggplot(ML.dta, aes(x = perfrl, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% w/ Free/Reduced Lunch") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p9 <- ggplot(ML.dta, aes(x = perecd, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% Economically Disadvantaged") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p10 <- ggplot(ML.dta, aes(x = perell, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% English Language Learners") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p11<- ggplot(ML.dta, aes(x = perspeced, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% Studnets in Special Ed.") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p12 <- ggplot(ML.dta, aes(x = totenrl, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("# Enrolled in District") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p13 <- ggplot(ML.dta, aes(x = sesavgall, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("SES Score") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p14 <- ggplot(ML.dta, aes(x = lninc50avgall, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("Log  Median Income") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p15 <- ggplot(ML.dta, aes(x = baplusavgall, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% Adults in District w/ BA") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p16 <- ggplot(ML.dta, aes(x = povertyavgall, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("Poverty Rate (%)") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p17 <- ggplot(ML.dta, aes(x = unempavgall, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("Unemployment Rate (%)") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p18 <- ggplot(ML.dta, aes(x = snapavgall, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("SNAP Rate (%)") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p19 <- ggplot(ML.dta, aes(x = single_momavgall, y = cs_mn_avg_ol)) +
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("% Students w/ Single Mom") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p20 <- ggplot(ML.dta, aes(x = ppexp_tot, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("Expenditure Per Pupil (USD)") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

p21 <- ggplot(ML.dta, aes(x = ppexp_inst, y = cs_mn_avg_ol)) + 
  geom_point(size=0.05, shape = ".", alpha = 0.5, color = "blueviolet") + 
  xlab("Instuction Expenditures Per Pupil (USD)") + ylab("District Average Test Scores") + theme_light() + geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,1))

all.predictors.plot <- ggarrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p20, p21,
                        nrow = 1, ncol = 3)

For the binary predictor (urban suburb, town, rural), we can use a box plot:

boxplot.dta <- ML.dta %>% mutate(locale = 
                                   case_when(urban == 1 ~ "Urban", 
                                             suburb == 1 ~ "Suburban",
                                             town == 1 ~ "Town",
                                             rural == 1 ~ "Rural"))

ggplot(boxplot.dta, aes(locale, cs_mn_avg_ol, group = locale)) + 
  geom_boxplot(fill = "cyan2", outlier.color = "black", outlier.fill = "cyan2", outlier.shape = 21) + labs(x = "District Locale", y = "District Average Test Scores") + theme_light()

Our above scatter plots confirm what we initially suspected: many of these variables do in fact predict academic achievement. For example, it appears baplus_all, the percent of adults in the district with a bachelor’s degree, is positively correlated with academic achievement. On the other hand, the percent of households in poverty or with a single mother appear to be negatively correlated. For other variables, such as per pupil expenditure (ppexpt_inst), the association is less clear. Overall, while these scatterplots give us a general sense of what variables may be important, it still is not clear which variables matter the most for academic achievement.

To answer this question, we will turn to machine learning models. We will first use random forest to determine variables of importance in predicting academic achievement, as measured by the Gini coeffient. We will then addition use a regression tree to see which variables are included to predict relative test scores.

Using Machine Learning to Determine Predictors of Importance

Creating the Training and Test Sets

To begin our machine learning exercises, we first must great our training and test sets. Here we are using training and test sets that both include 50% of the dataset.

set.seed(20)
#creating a data partition that splits my data frame in half
index_train<- createDataPartition(y = ML.dta$cs_mn_avg_ol, times =1, p=0.05, list = FALSE)

#labeling the first "slice" as the train set
train_set <- slice(ML.dta, index_train)

#labelign the second "slice" as the test set
test_set <- slice(ML.dta, -index_train)

Random Forest

Next, we will use a Random Forest Model to determine variables of importance, as measured using the Gini Coefficient.

set.seed(1)

#fitting a random forest with all 28 predictors
rf_fit <- randomForest(cs_mn_avg_ol ~ urban + suburb + town + rural + perind + 
                                 perasn + perhsp + perblk + perwht + 
                                 perfl + perrl + perfrl + perecd + perell +
                                 perspeced + totenrl +
                                 sesavgall + lninc50avgall +
                                 baplusavgall + povertyavgall +
                                 unempavgall + snapavgall +
                                 single_momavgall + ppexp_tot +
                                 ppexp_inst + pprev_tot, data = train_set, mtry = 28, importance = TRUE)
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range
variable_importance <- importance(rf_fit)

variable_importance <- tibble(Predictor = rownames(variable_importance),
                  Gini = variable_importance[,1]) %>%
                  arrange(desc(Gini))

labels <- c(urban = "Urban District", suburb = "Suburban District", town = "Town District", rural = "Rural District", perind = "% Native American Students", perasn = "% Asian Students",
                             perhsp= "% Hispanic Students",
                             perblk= "% Black Students",
                             perwht= "% White Students",
                             perfl= "% Students w/ Free Lunch",
                             perrl= "% Students w/ Reduced Lunch",
                             perfrl= "% Students w/ Free or Reduced Lunch",
                             perecd= "% Students Economically Disadvantaged",
                             perell= "% English Lanugage Learners",
                             perspeced= "% Students with Special Ed.",
                             totenrl= "Number of Students in District", 
                             sesavgall= "SES Composite Score", 
                             lninc50avgall= "Log  Median Income",
                             baplusavgall= "% Adults in District w/ Bachelors(+) degree", 
                             povertyavgall= "Poverty Rate (%)",
                             unempavgall= "Unemployment Rate (%)",
                             snapavgall= "SNAP Rate (%)",
                             single_momavgall = "% Students w/ Single Mom",
                             ppexp_tot = "Expenditures Per Pupil (USD)",
                             ppexp_inst = "Instuction Expenditures Per Pupil (USD)",
                             pprev_tot= "Revenue Per Pupil (USD)"
            )
variable_importance$labels <- as.character(labels[variable_importance$Predictor])

chart.importance <- ggplot(variable_importance, aes(x = reorder(labels, Gini), y = Gini)) + geom_col(width = 0.5, fill = "darkolivegreen3") + theme_light() + coord_flip() + xlab("Predictor") + ylab ("Gini Coefficient"); chart.importance

From the above outputs we can see two things. First, our random forest model is fairly accurate at predicting test scores with an accuracy of 80.47%. Second, as measured by the Gini coefficient, percent free lunch is the most important predictor of test scores followed by percent of adults who have their bachelors degree. Other important indicators include socio-economic status and poverty.

So based on these analyses, what predictors should we focus on? It is clear from the random forest that some variables are more important for academic success than others. For example, percent of students receiving free lunch has high priority in both the random forest model and the regression tree models so warrants attention in future analyses. We additionally want to balance the outcomes of this exploratory data analysis with the interests of our team members. For example, socio-economic status and related predictors, such as poverty, are often described as strong predictors of academic success (source?). Therefore, it would also be worthwhile to explore these variable further and the nature of their associations with academic achievment outcomes.

feel free to add to or change any of the above to make the transition smoother

Therefore, we explore different covariates of interest that highly correlate with average district academic achievement, relative to the national mean measured in standard deviation units, including unemployment, free lunch, SES, and poverty. In this section we look at poverty, unemployment, and SES composite scores.

Summary statistics and correlations

To reiterate what we was mentioned above, we use a stargazer to observe the mean and standard deviations of each variable and then review the correlation between each of the unemployment, SES scores, poverty, and average academic achievement variables. We see that SES, poverty, and unemployment are all highly correlated with one another as expected. Furthermore, we observe that all three variables are correlated with average academic achievement with SES and average test scores having a correlation of 0.76, unemployment and average academic achievement have a negative correlation of -0.57 and poverty and average academic achievement also having a negative correlation of -0.67.

dat2 <- dat %>%
  group_by(stateabb) %>%
  summarise(ses = mean(sesavgall), poverty = mean(povertyavgall), unemployment = mean(unempavgall))


stargazer(data.frame(dplyr::select(dat, cs_mn_avg_ol, unempavgall, povertyavgall, sesavgall)),
          type="text")
## 
## ==================================================================
## Statistic       N    Mean  St. Dev.  Min   Pctl(25) Pctl(75)  Max 
## ------------------------------------------------------------------
## cs_mn_avg_ol  12,366 0.021  0.334   -2.341  -0.185   0.222   1.248
## unempavgall   12,366 0.060  0.020   0.002   0.047    0.071   0.221
## povertyavgall 12,366 0.126  0.062   0.007   0.081    0.159   0.460
## sesavgall     12,366 0.333  0.846   -4.401  -0.157   0.872   2.910
## ------------------------------------------------------------------
#correlation between outcome and unemployment, poverty, and ses
round(cor(dplyr::select(dat, cs_mn_avg_ol, unempavgall, povertyavgall, sesavgall)), 2)
##               cs_mn_avg_ol unempavgall povertyavgall sesavgall
## cs_mn_avg_ol          1.00       -0.57         -0.67      0.76
## unempavgall          -0.57        1.00          0.65     -0.75
## povertyavgall        -0.67        0.65          1.00     -0.92
## sesavgall             0.76       -0.75         -0.92      1.00

Variation in each variable

After seeing the descriptive statistics of each variable, we plot histograms of unemployment, SES scores, poverty, and average academic achievement to display the distribution of each variable. SES is measured as “sesavgall” which is the SES composite score for all families in the district between 2005-2009 and 2014. “povertyavgall” represents the poverty rate (eb estimate) of all families in the district between 2005-2009 and 2014 and the “unempavgall” represents the unemployment rate (eb estimate) of all families in the district between 2005-2009 and 2014. We create histogram and annotate them with the means and standard deviations of each variable. We see that poverty, academic achievement, and SES all have wider spreads of variation and the distribution of unemployment seems to be more narrow.

Explain more how these scores are interpreted (ie. zero = average national SES)

# How much variation is there in average academic achievement?
ggplot(aes(x=cs_mn_avg_ol), data = dat) +
  geom_histogram(color="black", fill="cyan2", binwidth = 0.05) +
  annotate("text", x=-1.8, y=750, label = "Mean=0.021, SD=0.33",
           size=5, hjust=0, vjust=0) +
  xlab("District Academic Achievment (Test Scores)") +
  ylab("Frequency") +
  theme_classic()

# How much variation is there in unemployment?
ggplot(aes(x=unempavgall), data = dat) +
  geom_histogram(color="black", fill="lightsalmon1", binwidth = 0.005) +
  annotate("text", x=-0.15, y=1010, label = "Mean=0.06 , SD=0.02",
           size=5, hjust=0, vjust=0) +
  xlab("Average Unemployment") +
  ylab("Frequency") +
  theme_classic()

# How much variation is there in poverty?
ggplot(aes(x=povertyavgall), data = dat) +
  geom_histogram(color="black", fill="darkolivegreen3", binwidth = 0.01) +
  annotate("text", x=0.21, y=600, label = "Mean = 0.126, SD=0.062",
           size=5, hjust=0, vjust=0) +
  xlab("Average Poverty") +
  ylab("Frequency") +
  theme_classic()

# How much variation is there in SES?
ggplot(aes(x=sesavgall), data = dat) +
  geom_histogram(color="black", fill="blueviolet", binwidth = 0.1) +
  annotate("text", x=-3.5, y=500, label = "Mean = 0.33, SD=0.85",
           size=5, hjust=0, vjust=0) +
  xlab("Average SES") +
  ylab("Frequency") +
  theme_classic()

Plots of Association

Now we want to utilize ggplot to plot the association between the three socioeconomic factors including unemployment, SES scores, and poverty with average academic achievement. We also obtain a pearson correlation coefficient (r) for each graph and annotate it on the plot. Each point represent a district and the size of each point is the total enrollment between grades 3-8 in that district. Lastly, we draw a non-linear line that follows the trajectory of our data points. In the first plot, we see that there is a negative association between unemployment and average district academic achievement. Our r= -0.57. Similarly in the next plot, we also observe that there is a negative association between poverty and average district academic achievement with an r = -0.67. In the last plot, we see that there is actually a positive association between SES and average district academic achievement with the highest correlation coefficient of r = 0.76.

# What is the association between unemployment and average academic achievement?
r1 <- round(cor(dat$cs_mn_avg_ol, dat$unempavgall),2)
ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha = 0.3, pch=21, color = "black", fill = "lightsalmon1", aes(size = totenrl)) +
  geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
  annotate("text", x=0.03, y=-1.5, hjust=1, vjust=0,
           label = paste0("r=",r1)) +
  scale_size(range = c(1,30)) +
  guides(size="none") +
  ggtitle("Association Between Unemployment and Average Academic Achievement") + theme_light() +
  xlab("Average Unemployment") +
  ylab("District Academic Achievment (Test Scores)")

# What is the association between poverty and average academic achievement?
r2 <- round(cor(dat$cs_mn_avg_ol, dat$povertyavgall),2)
ggplot(aes(x=povertyavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha = 0.3, pch=21, color = "black", fill = "darkolivegreen3", aes(size = totenrl)) +
  geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
  annotate("text", x=0.05, y=-1.5, hjust=1, vjust=0,
           label = paste0("r=",r2)) +
  scale_size(range = c(1,30)) +
  guides(size="none") +
  ggtitle("Association Between Poverty and Average Academic Achievement") + theme_light() +
  xlab("Average Poverty") +
  ylab("District Academic Achievment (Test Scores)")

# What is the association between SES and average academic achievement?
r3 <- round(cor(dat$cs_mn_avg_ol, dat$sesavgall),2)
ggplot(aes(x=sesavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha = 0.3, pch=21, color = "black", fill = "blueviolet", aes(size = totenrl)) +
  geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
  annotate("text", x=-3, y=0.75, hjust=1, vjust=0,
           label = paste0("r=",r3)) +
  scale_size(range = c(1,30)) +
  guides(size=FALSE) +
  ggtitle("Association Between SES and Average Academic Achievement") + theme_light() +
  xlab("Average SES") +
  ylab("District Academic Achievment (Test Scores)")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Unemployment and Average Academic Achievement

First, we create boxplots to display the range of the median unemployment rates between all 50 states. Here, we observe that South Dakota has the lowest unemployment rate and Mississippi has one of the highest unemployment rates (excluding DC). Then, Montana seems to have similar unemployment rates compared to Massachusetts. We note this to plot four states that we can compare in terms of their unemployment-academic achievement gradient.

# The unemployment distribution between all 50 states ordered by the median
p_unempdist <- dat %>% 
  mutate(stateabb = reorder(stateabb, unempavgall, FUN = median)) %>%
  ggplot(aes(stateabb, unempavgall)) +
  geom_boxplot(color = "darkorchid4") +
  #scale_y_discrete() +
  ylim(0, 0.2) +
  scale_x_discrete() +
  theme(axis.text.x = element_text(angle = 90, hjust = 0.5, size = 7)) +
  ggtitle("Distribution of Unemployment rates for each state") +
  xlab("States") +
  ylab("Unemployment Rate") + theme_light()
p_unempdist
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Next, we found that according to the Bureau of Labor Statistics, Nebraska has the lowest unemployment rate as of September 2021 at 2.0. Washington and Massachusetts have very similar unemployment rates of 4.9 and 5.2, respectively. Lastly, California has the highest unemployment rate of 7.5. Therefore, we also specifically look at the differences in the unemployment-achievement gradient between Nebraska, Washington, Massachusetts, and California. To start off, we plot the boxplots of unemployment rates for the four states and note the differences in their medians.

# The unemployment distribution between WA vs. MA vs. NE vs. CA
dat %>%
  filter(stateabb == "WA" | stateabb == "MA" | stateabb == "NE" | stateabb == "CA") %>% 
  ggplot() +
  geom_boxplot(aes(x= stateabb, y=unempavgall, fill=stateabb), color = "black", outlier.colour = "navy", outlier.size = 1, notch = FALSE) +
  scale_y_continuous(trans = "sqrt") +
  xlab("") + 
  ylab("Unemployment") +
  theme(legend.position ="none") + theme_light()

Then, in one of our final plots titled “How does the Unemployment-Achievement gradient differ in WA, MA, NE, and CA?”, we notice that CA has a huge range of unemployment rates with the line stretching wide into higher unemployment scores. Meanwhile, the line for Nebraska has a range that stretches into the low unemployment rates. We see that all four lines have negative slopes with higher unemployment rates with lower scores of academic achievement and lower unemployment rates with higher scores of academic achievement. However, when comparing between states, the two lines for CA and NE seem to have very similar slopes that overlap with one another. Massachusetts and Washington have similar ranges of unemployment rates but the line is overall at a higher level. Districts in MA display higher average academic achievement scores in standard deviation units (relative to the national mean) overall compared to CA, NE, and WA.

# How does the unemployment gradient compare in WA vs. MA vs. NE vs. CA?

#Filtering the data set to only include WA, MA, NE, CA
dat_wa_ma_ne_ca <- filter(dat, stateabb %in% c("WA","MA", "NE", "CA"))

ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha = 0.2, pch=21, color = "black", fill = "grey", aes(size = totenrl)) +
  geom_point(alpha=0.7, pch=21, color="black", aes(size=totenrl, fill=stateabb),
             data = dat_wa_ma_ne_ca) +
  geom_smooth(se=F, lwd=0.5, lty=2, method="lm", color="black", aes(fill=stateabb), formula = y~x, 
              data = dat_wa_ma_ne_ca) +
  scale_size(range = c(1,30)) +
  guides(size="none", color="none") +
  labs(fill = "State") +
  ggtitle("Academic Achievementy by Unemployment Rates", subtitle = "Comparing the relationships in WA, MA, NE, and CA") +
  xlab("Average Unemployment") +
  ylab("District Academic Achievment (Test Scores)") + theme_light()

ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha=0.04, pch=21, color="black", aes(size=totenrl, fill=stateabb),
             data = dat_wa_ma_ne_ca, show.legend = FALSE) +
  #geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
  scale_size(range = c(1,30)) +
  geom_smooth(se=F, lwd=1, lty=1, method="lm", aes(color=stateabb), formula = y~x,
              data = dat_wa_ma_ne_ca) +
  ggtitle("Academic Achievementy by Unemployment Rates", subtitle = "WA, MA, NE, and CA") +
  xlab("Average Unemployment") +
  ylab("District Academic Achievment (Test Scores)") +
  labs(color = "State") + theme_light()

Finally, we also plot the gradients for the four states of Massachusetts, Montana, South Dakota, and Mississippi specified earlier. Here, we notice that South Dakota has a wide range of unemployment rates that stretch into lower numbers. MT and SD seem to have similar slopes while MS has a slope that is closer to zero. The unemployment-achievement gradient slope seems to be highest in magnitude (most negative) for MA but again, we notice that the average academic achievement scores are higher overall at all unemployment rates compared to all other states.

# How does the unemployment gradient compare in MS vs. MA vs. SD vs. MT?
dat_ms_ma_sd_mt <- filter(dat, stateabb %in% c("MS","MA", "SD", "MT"))

  ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha=0.04, pch=21, color="black", aes(size=totenrl, fill=stateabb),
             data = dat_ms_ma_sd_mt, show.legend = FALSE) +
  scale_size(range = c(1,30)) +
  geom_smooth(se=F, lwd=1, lty=1, method="lm", aes(color=stateabb), formula = y~x,
              data = dat_ms_ma_sd_mt) +
  ggtitle("How does the Unemployment-Achievement gradient differ in MS, MA, SD, and MT?") +
  xlab("Average Unemployment") +
  ylab("District Academic Achievment (Test Scores)") +
  labs(color = "State") + theme_light()

We note that there are several flaws in solely using unemployment as a covariate. We believe that there could be additional variation happening that is not just coming from unemployment. For further analyses, we can look at additional covariates of interest that may impact academic achievement such as income earned, % of students receiving free lunch, poverty levels, environmental factors, housing, and social policies. We also ask why Massachusetts has significantly higher scores of academic achievement at every level of unemployment compared to several other states. Perhaps this can be due to the educational system and policies in MA and/or the different programs available to school districts in Massachusetts. Other variables that may impact the academic performance of students in each state include family environments (such as domestic violence and abuse), racism, and other economic and social measures.

Is this relationship specific to unemployment and academic achievement?

Furthermore, we chose to plot the academic achievement by SES composite scores for MA and WA to observe whether we will observe a difference in the relationship between SES and academic achievement and the relationship between unemployment and academic achievement. We see that there are some differences in the range and the gradient. By simply looking at the points, we notice that Massachusetts and Washington generally overlap in their relationships but MA does seem to have slightly higher levels of SES.

# How does the SES gradient compare in WA vs. MA?
dat_wa_ma <- filter(dat, stateabb %in% c("WA","MA"))

ggplot(aes(x=sesavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha = 0.2, pch=21, color = "black", fill = "grey", aes(size = totenrl)) +
  geom_point(alpha=0.7, pch=21, color="black", aes(size=totenrl, fill=stateabb),
             data = dat_wa_ma) +
  geom_smooth(se=F, lwd=0.5, lty=2, method="lm", color="black", aes(fill=stateabb),
              data = dat_wa_ma) +
  scale_size(range = c(1,30)) +
  guides(size=FALSE, color=FALSE) +
  labs(fill = "State") +
  ggtitle("How does the SES-achievement gradient differ in WA and MA?") +
  xlab("Average SES") +
  ylab("District Academic Achievment (Test Scores)") + theme_light()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## `geom_smooth()` using formula 'y ~ x'

Massachusetts and Washington may have similar unemployment rates but may look different on other scales such as housing or poverty or income levels. Therefore, we want to consider other covariates, such as % of free lunch, an indicator that our previous regression tree included as a predictor. We saw that the random forest model mentioned earlier concludes that percent of students receiving free lunch has high priority in influencing academic achievement.

Regression Model Without Interaction Terms

Since we are interested in school-related food-assistance policy, we chose percent free or reduced lunch in the district as our variable of interest for this analysis. While SNAP benefits influence academic achievement outcomes, the program targets individuals and households, not schools directly. Moreover, as shown before, percent free or reduced lunch was a stronger predictor of academic achievement than SNAP benefits. We will estimate the effect of percent free or reduced lunch, socioeconomic status, and geography on academic achievement by running a regression model with percent free or reduced lunch, urban/rural, and socioeconomic status.

set.seed(1)
#regression model no interaction
model <- lm(cs_mn_avg_ol~perfrl+urban+sesavgall, data=dat_free_lunch)
broom::tidy(model)

From our model, we can see that our main predictor (percent free or reduced lunch) as well as the other two covariates (socioeconomic status and geography) have a statistically significant relationship with the outcome variable (academic achievement) at a 5% significance level. Specifically:

This model assumes that that the effect of any individual predictor (percent free or reduced lunch, SES, geography) on the outcome (academic achievement) is independent of the other predictors in the model. However, this may not be a reasonable assumption given that percent free or reduced lunch may have a different effect on academic achievement based on different values of socioeconomic status and urban versus rural settings. Therefore, as opposed to considering the effect of socioeconomic status and geography in isolation, we could see how they interact with perfect free or reduced lunch by adding interaction terms in our model.

Regression Model with Interaction Terms

#regression model with interaction 
model2 <- lm(cs_mn_avg_ol~perfrl+urban+sesavgall+perfrl*urban+perfrl*sesavgall, data=dat_free_lunch)
broom::tidy(model2)

From the interaction model, we can see that our interaction terms are both statistically significant. Hence, we know that the effect of percent free or reduced lunch on academic achievement is different for urban versus rural districts as well as for different values of socioeconomic status. We can show the interaction visually by graphing the relationship between percent free or reduced lunch and academic achievement by urban versus rural districts and for different values of socioeconomic status. However, to better plot and interpret the values of socioeconomic status which is currently a continuous variable, we can create categories of socioeconomic status like they do in the Stanford Education Data Archive Technical Documentation (https://stacks.stanford.edu/file/druid:db586ns4974/seda_documentation_4.0.pdf).

dat_free_lunch <- dat_free_lunch %>%  mutate(ses_cat = case_when(
                                                sesavgall < -2.5 ~"Below 2.5",
                                                 sesavgall >= -2.5 &  sesavgall < -1.5 ~ "Between -2.5 & -1.5",
                                                 sesavgall >= -1.5 &  sesavgall < -0.5 ~ "Between -1.5 & -0.5",
                                                 sesavgall >= -0.5 &  sesavgall < 0.5 ~ "Between -0.5 & 0.5",
                                                 sesavgall  >= 0.5 &  sesavgall < 1.5 ~ "Between 0.5 & 1.5",
                                                 sesavgall >= 1.5 &  sesavgall < 2.5 ~ "Between 1.5 & 2.5",
                                                 sesavgall >= 2.5 ~"Above 2.5"))
library(ggpubr)
F = as.formula("y~x")

lunch_plot<-dat_free_lunch %>% mutate(urban=recode(urban,`1` = "Urban", `0` = "Rural")) %>% ggplot(aes(perfrl, cs_mn_avg_ol, color=as.factor(urban))) + geom_smooth(method = "lm", se=FALSE)+scale_color_manual(values=c("darkolivegreen3","blueviolet"))+labs(x="% Free or Reduced Lunch", y="Academic Achievement", title="Free/Reduced Lunch & Academic Achievement by Geography")+guides(color=guide_legend(title="Geography"))+stat_regline_equation(label.x = 0.75)+theme_light(); lunch_plot
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 590 rows containing non-finite values (stat_smooth).
## Warning: Removed 590 rows containing non-finite values (stat_regline_equation).

ses_plot<-dat_free_lunch %>% drop_na() %>%ggplot(aes(perfrl, cs_mn_avg_ol, color=as.factor(ses_cat))) + stat_smooth(method="lm",fullrange=TRUE, se=FALSE,size=0.65)+scale_color_manual(values=c("darkolivegreen3","blueviolet","black","cyan2","goldenrod","blue","lightsalmon1"))+labs(x="% Free or Reduced Lunch", y="Academic Achievement", title="Free/Reduced Lunch & Academic Achievement by SES")+guides(color=guide_legend(title="SES category"))+stat_regline_equation(label.x = 0.7, label.y.npc=0.9)+theme_light(); ses_plot
## `geom_smooth()` using formula 'y ~ x'

The graphs above show that the relationship between percent free or reduced lunch and academic achievement is in fact different for districts in rural versus urban areas as well as for different values of socioeconomic status. The slope for urban areas is slightly steeper than that of rural areas meaning that the effect of percent free or reduced lunch on academic achievement in districts is greater in urban areas than rural areas. Furthermore, the relationship between percent free or reduced lunch and academic achievement changes more rapidly in urban areas than in rural areas. We can also see that the relationship varies for different categories of socioeconomic status, especially for districts that are at the extremes (2.5 standard deviation units above and 2.5 standard deviation units below the average national socioeconomic status). The relationship between percent free or reduced lunch and academic outcomes is positive for districts that are 2.5 standard deviation units above whereas for all other socioeconomic status categories the relationship is negative. Additionally, it seems that the relationship is particularly strong in districts with SES that are between 1.5 and 2.5 standard deviation units above the average national socioeconomic status. However, there are several other factors that may affect academic achievement such as parental educational background, parenting style, parental occupation, financial security in a household, and the quality of food in a school. Percent free or reduced lunch relates to accessibility of food but not the quality of the food, which is an important difference. Hence, a limitation to this analysis is that it does not explore how these other relevant factors may affect the relationship between academic achievement and percent free or reduced lunch at different levels of SES and urban versus rural settings. While this relationship is evaluated at the district level in this data set, perhaps looking at the effects at the school or county level would have different implications. Moreover, this analysis looked at this relationship at a national level. Analyzing the relationship by districts in a state may also show between state variation. Free or reduced lunch programs also vary by state in terms eligibility and quality of food, which is an important difference that is not considered in this analysis. All in all, the relationship between academic achievement and percent free or reduced lunch is highly complex and nuanced, with several observable and unobservable factors affecting how food assistance policies influence academic achievement outcomes in districts. What the data and this analysis specifically do show is that the relationship between academic achievement and percent free or reduced lunch differs significantly in urban versus rural districts and in different levels of socioeconomic status.